Interpretable Differential Diagnosis of Non-COVID Viral Pneumonia, Lung Opacity and COVID-19 Using Tuned Transfer Learning and Explainable AI

The coronavirus epidemic has spread to virtually every country on the globe, inflicting enormous health, financial, and emotional devastation, as well as the collapse of healthcare systems in some countries. Any automated COVID detection system that allows for fast detection of the COVID-19 infection might be highly beneficial to the healthcare service and people around the world. Molecular or antigen testing along with radiology X-ray imaging is now utilized in clinics to diagnose COVID-19. Nonetheless, due to a spike in coronavirus and hospital doctors’ overwhelming workload, developing an AI-based auto-COVID detection system with high accuracy has become imperative. On X-ray images, the diagnosis of COVID-19, non-COVID-19 non-COVID viral pneumonia, and other lung opacity can be challenging. This research utilized artificial intelligence (AI) to deliver high-accuracy automated COVID-19 detection from normal chest X-ray images. Further, this study extended to differentiate COVID-19 from normal, lung opacity and non-COVID viral pneumonia images. We have employed three distinct pre-trained models that are Xception, VGG19, and ResNet50 on a benchmark dataset of 21,165 X-ray images. Initially, we formulated the COVID-19 detection problem as a binary classification problem to classify COVID-19 from normal X-ray images and gained 97.5%, 97.5%, and 93.3% accuracy for Xception, VGG19, and ResNet50 respectively. Later we focused on developing an efficient model for multi-class classification and gained an accuracy of 75% for ResNet50, 92% for VGG19, and finally 93% for Xception. Although Xception and VGG19’s performances were identical, Xception proved to be more efficient with its higher precision, recall, and f-1 scores. Finally, we have employed Explainable AI on each of our utilized model which adds interpretability to our study. Furthermore, we have conducted a comprehensive comparison of the model’s explanations and the study revealed that Xception is more precise in indicating the actual features that are responsible for a model’s predictions.This addition of explainable AI will benefit the medical professionals greatly as they will get to visualize how a model makes its prediction and won’t have to trust our developed machine-learning models blindly.


Introduction
The novel coronavirus, taxonomically known as Severe acute respiratory syndrome coronavirus 2 or SARS-CoV-2 and labeled by the World Health Organization (WHO) as COVID-19, appeared in Wuhan, Hubei Province, China, before the end of 2019 [1] and has sparked unprecedented global fear. The World Health Organization declared this a public health emergency of worldwide concern and dubbed it a global pandemic due to the • Three CNN models were developed for COVID-19 mass screening (two classes: COVID positive & COVID negative) from chest X-ray images. Afterward, explainable AI was applied to demystify the black box of models. • Three additional multi-class models were constructed to diagnose non-COVID, COVID, lung opacity and non-COVID viral pneumonia from chest X-ray radio graphs. Furthermore, explainable AI was used to validate and explain the performance of each generated multi-class model and to demystify the black box of individual CNN layers • Presented a comprehensive performance study of the proposed binary and multi class systems in terms of the confusion matrix, accuracy, sensitivity, specificity, and F1-score. Additionally, we compared test accuracy for the implemented dual class CNN models for different training input image resolutions and investigated the impact of input image size on the models' accuracy.

Background Study
Due to global demand and the advent of Artificial Intelligence (AI), a considerable body of research has evolved to apply Artificial Intelligence (AI) to diagnose different respiratory disorders, particularly those directly related to COVID, utilizing basic XR and CT images. However, the studies in many articles focused on the modifications and accuracy of network design, but less on explaining models, and most models are used to improve the accuracy of models with small data sets. COVID-Net was presented by L. Wang et al., a convolutional neural network [20], in order to detect COVID-19 cases among roughly fourteen thousand chest X-ray images. However, the proposed model managed to gain only 83.5 percent in terms of accuracy. Khan et al. [21] used the transfer learning method on 310 cases of normal pneumonia, 330 cases of bacterial pneumonia, 327 cases of non-COVID viral pneumonia, and 284 COVID-19 pneumonia photos and got 89.5% accuracy. However, as the models employed in this paper used a small number of images, they required detailed analysis. Explainable AI was not implemented to show model efficacy. Y. Oh, S. Park and J. Ye trained the Resnet 18 model and showed explainability [22], but the paper used a low number of images in training and testing and got an accuracy of 89%. Explainability in each layer of the model is desired. T. Ozturk et al. [23] constructed a Darknet framework and trained utilizing only 127 COVID images, and managed to reach 87% accuracy, but model explainability was not explored. A. Altan and S. Karasu [24] trained EfficientNet B and got a promising 99% accuracy, but it was trained with only 219 COVID images, and Explainable AI was not implemented. J. Civit-Masot et al. [25] trained VGG16 with 132 COVID images and got an averageF1 (avgF1) score of 85%, which is not promising in a real clinical scenario due to a low F1 score. Inception v3, InceptionResNetv2, Resnet50 were trained by A. Narin, C. Kaya and Z. Pamuk [26] and got a good accuracy of 98%, which is promising, but those models were trained with 50 normal and 50 COVID images, and the explainability of the model was not depicted. Some methods have focused on creating false data in order to train better models due to the insufficiency of COVID-19 photos. To create fake images, an additional Generative Adversarial Network (GAN) has been used [27]. The results showed that data augmentation boosted accuracy on the VGG16 net from 85% to 95% [28]. J. Arias-Londono, J. Gomez-Garcia, L. Moro-Velazquez and J. Godino-Llorente trained the COVID-net model with large et of images but got 91.53% accuracy [29]. Before a model can be utilized in clinics, it must undergo extensive analysis. Summary results from different letters are given in Table 1. In order to address the research's recurrent issues, we built three transfer learningbased Neural Networks in this study in order to categorize COVID-19 utilizing bigger data sets, exhibiting promising accuracy in an unseen dataset. Likewise, we leveraged Explainable AI to demistify the blackbox of the stated three models and illustrate how the system identifies COVID from X-ray images across all blocks and layers by combining heatmap and original images. Our explainability not only serves as a responsible and transparent audit of our models, but also may assist physicians in improving the screening of COVID-19. Moreover, we trained all three models using images with different input resolutions and found that increasing the image resolution during training improves accuracy in the test condition.Additionally, to address the problems associated with diagnosing patients with non-COVID viral pneumonia and Lung Opacity, we constructed a multiclass model (four classes), validated its accuracy, and explained it using Explainable AI.

Methodology
In this section, the methodology is depicted in the following order; accumulating data to train the Neural Network, image preprocessing, the experiments and training the Neural Network and evaluation processes.

Dataset
The study was created utilizing an XR image dataset comprising the PA and AP views (Posterior-Anterior (PA) and Anterior-Posterior (AP) views. This study looked at two different pathological conditions: normal and COVID-19. The majority of the data for this investigation came from Kaggle's COVID-19-radiography-database [33][34][35][36][37][38][39][40]. The COVID-19-radiography-database was developed by combining different databases [41,42]. For example, images from the COVID-19 DATABASE of the Italian Society of Medical and Interventional Radiology (SIRM) [41], the Novel Corona Virus 2019 Dataset by Joseph Paul Cohen and Paul Morrison, and Lan Dao [42], 43 articles (radiography metadata contains references), Normal and non-COVID viral pneumonia images from the Chest X-ray Images (pneumonia) database. The database includes 3616 COVID-19 positive cases along with 10,192 normal circumstances, 6012 Lung Opacity (Non-COVID lung infection), and 1345 non-COVID viral pneumonia images. Figure 1 depicts two sample instances that contain both COVID and normal classes. Figure 2 shows sample images used for multiclass classification.
As stated before we have considered four types of classes. Among them, COVID-19 is a virus.

COVID-19
COVID-19 is a deadly virus caused by SARS-CoV-2. A minor to acute respiratory illness characterizes the bulk of virus-infected people, who often recover without having to receive special treatment. Nevertheless, few individuals experience severe ailments and require medical attention. Elderly people and people with underlying medical conditions are more prone to have serious medical condition. Anyone who contracts COVID-19 might get critically sick or die at any age.

Non-COVID Viral Pneumonia
Non-COVID viral pneumonia is an infection of the lungs caused by a virus. These viruses usually reside in the upper part of the respiratory system. The issue, though, is when they get into any individual's lungs. The airways in ones lungs subsequently deteriorate, swell, and flood with fluid. Any element that lowers the effectiveness of ones immune system might make the individual more susceptible to developing pneumonia.

Lung Opacity
Opacity is generally an umbrella term. Any location that selectively attenuates the Xray beam and looks more opaque than the surrounding area is referred to as having opacity. It's a general word that doesn't specify the size or pathologic makeup of the aberration [43]. Any region of the chest radiograph that is whiter than it should be is referred to as such. The lungs generally contain a significant amount of air. When a person gets pneumonia, various substances such as fluids, germs, immune system cells, etc. replace the air in the lungs. Because of this, regions with opacities seem grey when they need to be more black. We realize that the lung tissue in that region is probably unhealthy when we observe them. It includes a variety of other lung infections such as lung cancer, pulmonary edema, pulmonary embolism, etc.

Image Processing
Prior to being fed through neural networks, the X-ray images with the posterioranterior (PA) and anterior-posterior (AP) views were scaled. We developed six models in total, three of which are utilized to classify normal and COVID pictures. The remaining three models are utilized for multiclass categorization in order to determine whether normal, COVID, lung opacity, and non-COVID viral pneumonia are present in the image.
We carried out two investigations in this paper for binary classification. Data were normalized and scaled to 224 by 224 pixels for the Resnet50 and VGG19 models' first analysis. According to the standard specification, images for the Xception model are rescaled to 299 by 299 pixels. For our second experiment, we scaled all of the pictures to 512 × 512 pixels for the three models we developed for two class classifications. The investigation chose 1000 images from positive and 1000 images from non-COVID cases randomly from 3616 COVID-19 positive photographs and 10,192 normal-COVID images from the database. A random 80/20 split was utilized to evaluate the trials' train and test splits. Moreover, 20% of the train split is utilized as a validation set in order to prevent overfitting. In terms of the second setup models for multiclass, we used 1000 images each of COVID, normal, lung opacity, and non-COVID viral pneumonia instances. The images were resized as per models' standard requirements, i.e., for Resnet50 and VGG19 input images were resized to 224 by 224 pixels and normalized. For the Xception model, input images were resized to 299 by 299 pixels.
The dataset was also adjusted using Z-normalization. In machine learning methods, standardization helps to stabilize the model while also increasing the training pace. Z normalization is utilized through the following (1).
Here, mean value is represented by µ i and σ i denotes standard deviation of the feature.
In this letter, we trained, validated, and tested the collected chest X-ray data in six different Convolution Neural Networks. The Xception, Resnet50, and VGG19 models for both binary class and multiclass were adjusted slightly in the last two layers to achieve regularization.

Neural Network Models
For both the binary class and multiclass set up, three different Convolution Neural Networks were employed to train, validate, and test the accumulated chest X-ray data. The dataset was trained utilizing Xception, VGG19, and Resnet50 models employing Google's Colab Pro edition hardware consisting of 26.3 GB of system RAM along with 16,160 MB of GPU RAM. Pre-trained weight "image-net" was used to speed up our model's learning, and the convolutional neural network models were slightly modified along the bottom layers in order to gain regularization. The class imbalance issue in the training data is addressed by applying a weighted categorical loss function. Models are constructed employing Adam optimizer with baseline parameters to ensure computational effectiveness and a quick learning rate. Early stopping has been used with a patience parameter of 15 in order to monitor model's validation loss and cease training the system once validation loss reaches a stationary state. Our system's last step was modified utilizing a factor of 0.25 along with patience 15 whenever validation loss hit a plateau. If the model's performance doesn't improve following 15 trials, the training is terminated.
The models were trained for 200 epochs through a mini-batch size of 16 pictures, and the confusion matrix, resulting ROC curve along with evaluation matrices were created. Furthermore, GradCam is used to build heatmaps in several layers. In order to identify the key areas in the image and predict the pathological state and model interpretability, the input image and heatmap are overlay. Complete system diagram employed in this study is depicted in Figure 3.

VGG19
In the Image Net Large Scale Visual Recognition Challenge (ILSVRC) in 2014, VG-GNet [44] was a neural network that did exceptionally well. It came in first place for image localization and second place for image classification. Figure 4 shows VGG19 in our classification experiment, which consists of 16 convolution layers, 5 Max-pooling layers, one average pooling, flattening, and a dense layer. VGG CNN has six major blocks which make the network deep, utilizing small 3 × 3 filters, with a stride of 1, and the same padding for the Conv Layers and 2 × 2 filters with a stride of 2 for the Maxpooling/Downsapling layers. In our tailored VGG19 network, we froze the weight up to the max-pooling layer of the conv5 block, and then we added one average pooling, flattening, and dense layer to finetune the model.

Resnet50
ResNet [45] has emerged as a ground-breaking deep neural network (DNN) model for computer vision problems. It made its debut in 2015 when it won the ImageNet [46] competition.
In most situations, DNNs outperform neural networks (NN) with fewer layers. However, training a massively stacked NN is notorious for its vanishing gradient issue, which causes model performance to deteriorate. Identity shortcut connections or skip connections that bypass one or more layers-have been used to overcome this issue in ResNet. The use of residual blocks, which consist of the main path and identify shortcut links, helps to solve the vanishing gradient problem, which is shown in Figure 5. The primary path consists of a sequence of Neural Networks, whereas the second path, known as skip connections, is a straight path from input to output. Skip links connect to the output of the network directly and the network provides output as F(X) + X. The Skip connection addresses the gradient vanishing problem associated with conventional neural networks.
We used the rasenet50 version of Resnet and tweaked it somewhat in the final stages. We utilized the weight of the image-net to create a classification detector for COVID and normal instances by freezing the weight of the top layers.Our proposed structure for the ResNet-50, which is used to categorize chest X-rays, is shown in Figure 6. On top of the pre-trained model, more layers are added. To classify images, an average pooling layer with a pool size of (4, 4), a flattening layer, a dense layer with Relu activation, a dropout layer with a dropout probability of 50% drops 50% of the parameters randomly and reduces overfitting, and finally, a dense layer is used.

Xception
Xception [47], a modified version of Inception-v3, is a Depthwise Separable Convolutionsbased deep neural network architecture which Google researchers developed. Each input stream is processed through a single convolutional filter during a convolution process known as depthwise convolution. A type of convolution known as pointwise convolution uses a 1 × 1 kernel that iterates over each point. The depth of this kernel is equivalent to the quantity of channels in the source image. To generate depthwise-separable convolutions, a pointwise convolution is combined with a depthwise convolution.
In the original depthwise separable convolution, a pointwise convolution comes after the depthwise convolution, while in the modified depthwise separable convolution, a depthwise convolution comes after the pointwise convolution. The Xception model used modified separable convolutions. Our tweaked Xception model is depicted in Figure 7. As in the figure, Separable Convolutions are the modified depthwise separable convolutions, and there are residual connections in the middle flow. In the Xception model, the data initially passes via the input flow, then eight times through the middle flow, and lastly through the exit flow. For the system, we employed pre-trained "image-net" weight, and we tweaked a couple of the model's final layers to gain regularization. Average Pooling layer with a pool size of 4 by 4, a flattening layer, and a dense layer is added to classify images.

GradCam
Grad-CAM, also known as gradient-weighted class activation mapping, is a method for identifying the areas of an image that are crucial for a certain classification decision. Convolutional neural networks (CNNs) are frequently employed for tasks in computer vision including, where it is utilized to interpret predictions. The process entails computing the gradients of the last convolutional layer's output with respect to the input picture and then adjusting the gradients to account for the relative weights of each feature map in the final judgment. To highlight the areas that are crucial for the projected class, the generated heatmap is superimposed over the original image. In order to comprehend the internal representations acquired by CNNs and also to enhance the interpretability of these models, Grad-CAM has been widely employed in the area of computer vision. It employs gradients from any target idea and feeds them into the final convolutional layer to create a coarse localization map that highlights key areas in the picture that are crucial for concept prediction.

The Experiments
Our tweaked and proposed three models were utilized in this investigation for two separate experiment setups.

Two Class Setup
In two class experiment setup we utilized all the three proposed model utilizing last dense layer with an output size of 2. we did two investigation with this experiment setup.
In the first investigation, models were trained with standard image requirements as per Xception, VGG19 and Resnet models. The standard inut size requirements for Xception, VGG19 and Resnet are 299 × 299, 224 × 224 and 224 × 224 pixels respectively, for each of the three algorithms. In the second investigation, we first resized the input photos to 512 × 512 pixels, and then we trained all of the proposed models using the resized input images.
For both the investigation, model performances are evaluated using unseen data. The performance of both experimental setup with three proposed model is evaluated using data that has not been observed by the model while training.

Multi/Four class Setup
In multi class experiment setup we used all the three proposed model utilizing last dense layer with an output size of 4. The Xception, VGG19, and Resnet models were trained in the field of standard picture needs. The typical criteria for inut size are 299 × 299 for Xception, 224 × 224 for VGG19 and Resnet accordingly.

Evaluating Model Performances and Deep Layer Feature Investigation
We evaluated accuracy, recall or sensitivity, precision or also known as Positive Predictive Value (PPV), F1 score, along with specificity in order to asses the performance of our employed algorithms for classifying X-ray pictures as per (2), (3), (4), (5), and (6). Precision [48] is defined as the ratio of correct positive identifications relative to all positive identifications. A low precision will result in a significant number of false positives, meaning patients will be mistakenly categorized as having a certain condition. The recall [48] is the number of true positives over the number of true positives plus the number of false negatives. Recall takes the false negative rate into consideration. Incorrect diagnosis may occur due to low recall rate. The F1-score (6) is the harmonic mean of precision and recall. A high F1 score is desired in the classification task. where, • i = COVID and Normal for classification problem.
This study employed the gradient weighted Class Activation Mapping (GradCAM) [49] technique to visualize the input regions significant for model predictions and to visualize heatmaps in three distinct CNN network layers to simplify the model performances. The comprehensive Gradcam analysis method is shown in Figure 8. By visualizing heatmap, it is often possible to depict why the model concluded a particular class. Prior to developing the class prediction of the image, we first sent an image through the model to produce a prediction. After that, we computed the gradient of the class with respect to Feature Map activation A k To derive the neuron significance weights, these gradients flowing back are globalaverage-pooled across the width and height dimensions (indexed by i and j, respectively).
We measure Grad Cam utilizing equation x.
L c By overlaying the input image with heatmap, we developed visualization. For the binary class, this letter examined the GradCam visualization for all the Xception model layers and compared heatmap across all the models to find better efficacy and explain which part of the features affecting model to decide a particular class. First, we analyzed different layers and different blocks of a model to find which regions affect categorisation using GradCam. Second, we compared the final convolution layer of activation maps of employed CNN models to demystify ita performances.For the multiclass classification we analyzed last CNN layers heatmap for the three models and concluded with comparing the same.

Result Analysis
The results of the experiments are broken down into two sections. In the first phase, results were analyzed statistically, and in the second phase, feature extraction in different layers of different models was visualized using GradCam and alalyzed why our model came to the conclusion of detecting COVID and non-COVID images.

Statistical Analysis
We assessed the model's performance using a separate test set that models were not exposed to during training. The models were evaluated statistically by computing the test precision, recall, F1-score (F1), accuracy (Acc), Positive Predictive Value (PPV), and the Area Under the ROC Curve (AUC).
The effectiveness of the three employed CNN network is examined for a two-class setting is summarized in Table 2. Performance is also shown for the implemented three networks for two different input image sizes. Similarly, Figure 9 shows the performance comparison of all three models at normal picture size as per model's requirement and 512 × 512 input images.
From Table 2, we can examine the accuracy of the Xception and VGG19 is 97% while training the model with standard image size requirements as per model, while the Resnet50 provides 92.5% accuracy. The F1 score for detecting normal and COVID images for Xception and VGG19 is 0.971, 0.969, 0.967, and 0.972 respectively, where as for the resnet model, it is 0.923 and 0.972. Clearly, Xception and VGG are outperforming the Resnet 50 considering accuracy and F1 score. The sensitivity to detecting COVID is higher than the other two models and it is 97.9%, where high precision of 97.7% is observed in VGG19 for the COVID class. For the normal class, recall is highest in VGG 19 and precision is highest in the Xception model.For the 512 × 512 image size, again, accuracy is higher in Xception and VGG19 and both provide an accuracy of 97.5%. The F1 score for COVID and the normal class for the Xception model is 0.976 and 0.973, where as in the VGG19 model, it is 0.975 and 0.975. So, it can be concluded that training with increased image size improves the accuracy and performance of the model. This may be because, while resizing to a lower resolution, we may lose some vital information from the image which may take part in decision making processes. Figure 9 shows two bar graphs to compare model to model performance at different image input resolutions while training. It is evident that Resnet is performing poorly compared to VGG19 and Xception. It is also evident that training with increased image resolution helps to achieve better performance of the models.
(a) Bar graph with standard image resolution (b) Bar Graph 512 × 512 Figure 9. Precision, recall, F1 score and AUC bar bar chart comparison for three models: Xception, VGG19 and Resnet50 while training the model with standard input image size and 512 × 512 input image size. Precision, recall, F1 score and AUC bar bar chart for three models when model is traned with standard image size (a). Precision, recall, F1 score and AUC bar bar chart for three models when models are trained with the images size of 512 × 512 (b). Resnet performs poorly, and with the increase in image size, accuracy increases.  Figure 10 shows the ROC curves per class for each model and the accompanying confusion matrices at standard image size as input during the training of models, and Figure 10 shows the same for the models when trained at 512 × 512 input images. In addition to the previous finding, it is undeniable that training models with higher picture quality results in enhanced performance of the models after they have been trained. The effectiveness of the three employed CNN network is examined for both binary and multi class classification. Figure 11 denotes the findings of our training for binary classification and multi-class arrangement is summarized in Table 3 and Figure 12. We can see the Xception model is outperforming VGG19 and Resnet50 and giving accuracy 93%. F1 score for Normal, COVID, Lung Opacity, and non-COVID viral pneumonia is 0.92, 0.92, 0.96 respectively. VGG19 is quite close performance wise and gives 92% accuracy.

Model'S Explainability and Interpretability
It is commonly feasible to show why the model determined a specific class using a heatmap. We qualitatively examined network-identified regions of interest using Grad-CAM activation maps and by displaying heatmaps at different layers of three proposed networks. To anticipate the pathological condition and model interpretability, and to further analyze, the input data along with its heatmap are superimposed to locate the critical spots in the image. Although the last CNN layer is generally used in GradCAM literature, we visualized all the network layers to study the learning process of the models, since the last layer contains high-level information. In this letter, we utilized the GradCAM approach to (1) visualize and compare the different layers of a model to identify the model's decision processes and (2) identify the network sections that have the most significant impact on categorization. (3) to compare activation maps of three CNN models' last convolution layer to demystify all the model performances. For the Xception model, we used GradCAM to display the activation maps for all the CNN blocks for COVID and non-COVID class images. We visulaized the activation map for all the layers in the same block for the Xception model. Furthermore, we implemented expainable AI to the Resnet50 and VGG19 models and compared the activation maps to see which model should perform better in clinic conditions. We approached GradCam analysis for four class classification to show the last conv layer output of the Xception models since the accuracy of Xception model is higher than the VGG19 and Resnet50. Figure 13 depicts the visualization results for the Separation Convolution layers in Block 14,12,10,8,4 and 1 for the Xception model for a ground truth normal image. From the original X-ray images, heatmap, and superimposed images, we can visualize that Separation Convolution block 14 has the most significant impact on image classification as it can see the high-level complicated features. Block 14 builds up features by combining all the features that were detected in the early layers. On the other hand, block 1 just impacts the model performance by detecting low-level features such as edges and colours. The Deep layers feature mainly contributed to better offering an explanation of the failure or success of a deep learning network in a particular decision. In this case, the Xception model detected the image truly and detected it as a class of image which is normal. Similarly, Figure 14 shows that X-ray images with ground truth COVID diagnosis has a comparable impact, indicating that Block 14 strongly influences finding classification areas. It can be seen that in the last block convolution layer, the model is looking into the chest region, which is the region of interest for detecting COVID, and the model almost sees all the regions of the COVID affected area.The Xception model was flawless in this case, predicting it. In summary, the last block contributes more to the classification than the whole block. Chest X-ray image from the first row or in images (a-f) is in the normal class. The second row or images (g-l) depicts the X-ray images after GradCam activation mapping. Activation map and an X-ray picture are overlaid in the third row or images (m-r). The block14 separation convolution 2 from the Xception model is displayed in the first column along with the input image, heatmap, and overlay image. The second, third, fourth, fifth, and sixth columns are for Block 12 sepconv2, Block 10 sepconv2, Block8 sepconv2, Block4 sepconv2, and Block1 conv1 respectively. Block 14 detects high-level features and impacts decision making, where as block 1 detects lower-level features such as colors and edges. Figure 14. GradCam assessment of Xception-modeled class images at different levels. A chest X-ray picture from the COVID class is displayed in the first row or in images (a-f). The second row or images (g-l) displays the GradCam activation mapping for the X-ray picture. Images (m-r) or the third row: An activation map with an X-ray image overlaid to display the superimposed image. The block14 separation convolution 2 of the Xception model's original picture, heatmap, and overlay image are shown in the first column. The second, third, fourth, fifth, and sixth columns correspond to Block 12 sepconv2, Block 10 sepconv2, Block 8 sepconv2, Block 4 sepconv2, and Block 1 conv1. Block 14 detects high-level features and influences decision-making, whereas block 1 detects lower-level features like colors and edges.
In the search for determining which layer in a convolutoin block provides more decisiveness, for COVID and normal classes, we checked the sep convolution layer, batchnormalization (bn) layer and activation (act) layer heatmap along with the original and superimposed image, shown in Figure 15. The first row displays COVID and normal chest xray images. In the second line, the xray image's GradCam activation mapping is displayed. The third row displays superimposed images. The input image, heatmap, and overlay image of the block14 separation convolution 2, batchnormalization, and ReLu activation layers of the Xception model for a normal class image are shown in the first three columns. The original image, activation mapping, and overlaid images of block14 separation convolution 2, batchnormalization, and ReLu activation layer for a COVID class image for the Xception model are depicted in the bottom third of the column. The activation layer model there observes a more narrow region in search for COVID and normal classes. Although the convolution layer watches the spread region, the activation contributes more to the decision making process. Since the activation layer is closest to the output, the activation layer must detect relevant features and in our Xception model it does the same. In the final stage of our model visulization for two class classification, deep layer feature investigation, and model explainability, We examined the heatmap of the most recent convolution layer for each model we had used. Figure 16 displays the GradCam analysis of COVID and standard class pictures at the final convolution layer in the Xception, VGG19, and Resnet models. A chest X-ray picture from the COVID and normal classes is displayed in the first row. The second row displays the GradCam activation mapping for the X-ray picture. The usperimposed image in the third row is displayed using an X-ray image put on an activation map. The original picture, heatmap, and overlay image from the Xception, VGG19, and Resnet50 models for a normal class image are shown in the first three columns. The final convolution layer of Xception, VGG19, and Resnet50 for a COVID class image appears in the bottom third of the column along with the original image, activation mapping, and overlay image. The final convolution layers for the Xception, VGG19 and Resnet models are block14 sep conv 2, block5 conv 4 and conv5 block 3_3 conv respectively. From the image shown, the VGG19 and Xception models are more looking into the high lever features in search of desired regions. For normal class, all three models look into the desired region and all three models predict it perfectly. However, in COVID class VGG19 misclassified the image and from the heatmap we can see that although it is watching in the chest region, it came into prediction by watching in the other portion rather than the chest. Figure 16. GradCam assessment of COVID and normal class images at the final convolution layer in the Xception, VGG19, and Resnet models. An X-ray image of the chest from the COVID class is depicted in the first row or in images (a-f). Images (g-l) or the second row depicts the GradCam activation mapping for the X-ray picture. Third row or images (m-r): An activation map with an X-ray image overlaid to display the superimposed image. The original picture, heatmap, and overlay image from the final convolution layer of the Xception, VGG19, and Resnet50 models for a normal class image are shown in the first three columns. For a COVID class image, the final convolution layer of Xception, VGG19, and Resnet50 is shown in the last third column along with the original picture, activation mapping, and overlay image. The final convolution layers for the Xception, VGG19 and Resnet models are block14 sep conv 2, block5 conv 4 and conv5 block 3_3 conv respectively. GradCam analysis of four class photos is shown in Figures 17 and 18. Explainable AI at the final convolution layer in the Xception model is visualized in Figure 17. First row: shows chest xray images of different classes. The second row displays the GradCam activation mapping for the X-ray image. Third row: An activation map with an X-ray image overlaid to display the superimposed image. The Normal class input picture, heatmap, and overlay image are shown in the first column of the Xception model's final convolution layer. The second, third, and fourth columns display the original image, activation mapping, and superimposed image for the COVID, Lung Opacity, and non-COVID viral pneumonia classes respectively. All the images were correctly classified in this case and we can see the model is watching the region of interest to differentiate Normal, COVID, Lung Opacity and non-COVID viral pneumonia classes. From Figure 18 it is possible to visualize that VGG19 misclassified COVID and non-COVID viral pneumonia classes as the last layer is watching some other points rather than the desired region. However, the Xception and Resnet models detected image classes perfectly in this case.

Discussion
We have utilized three different pre-trained models for our training that are Xception, VGG19 and ResNet50. For binary classification we considered two type of image sizes that are 299 × 299 and 512 × 512. With 512 × 512 image sizes, we have managed to reach 97.5%, 97.5%, and 93.3% for Xception, VGG19 and ResNet50 respectively. Figure 10 represents the ROC and the confusion matrix for each of the trained model for binary classification that are trained with 512 × 512 image sizes. Based on the performance metrics, the Xception model is shows more superiority because of its precise feature extraction mechanism in entry, middle and exit flow. From the confusion matrix we can observe that it can almost flawlessly classify COVID-19 images. Only 0.0025% of COVID-19 images are misclassified as normal X-ray images. On the other hand, 0.022% normal images are misclassified as COVID-19 images. With 299 × 299 image size input our Xception, VGG19 and ResNet50 model gained 97%, 97%, and 92.5%. Figure 11 depicts the ROC curve and the confusion matrix for the trained models with 299 × 299 image size. Here Once again, Xception Model turned out to be the most efficient model. It can classify COVID-19 images more correctly than any other models. However, VGG19 excelled in classifying normal images. It has the lowest false positive value in terms of classifying normal X-ray images.
After finalizing our binary classification model, we moved on to multi-class classification with VGG19, Xception, and ResNet50. Among the utilized models ResNet50 scored only 75% while VGG19 scored 92% and Xception Score 93%. Figure 12 denotes the generated ROC curve and confusion matrix from our trained multi-class models. Figure 12d states that our trained Xception model is superior in classifying non-COVID viral pneumonia images. However, the chances of COVID-19 images being misclassified with lung opacity are higher. In terms of VGG19 there are fewer false positive values. Although, it can classify non-COVID viral pneumonia and normal images more accurately it tends to misclassify COVID-19 images as lung opacity class. Although, both the model reached almost similar accuracy, it's fair to state that Xception is a more efficient model.

Conclusions
This article proposes a comprehensive CNN-based transfer learning approach for classifying between non-COVID viral pneumonia, lung opacity, COVID-19 and normal X-ray images. Throughout the manuscript, we have presented six framework whereas the first three models automatically recognize COVID-19 and the other three models classify between normal, COVID-19, lung opacity and non-COVID viral pneumonia images. Our experiment was split into two segments. In the initial segment, we only consider the binary classification of COVID-19 and in final segment we conduct multi-class classification. We have considered a varaity of image size while conducting our experiment and it is concluded that image input of 512 × 512 pixels delivers greater accuracy, which might be because image resizing loses some essential information from photos, lowering accuracy. For binary classification between COVID-19 and normal X-ray images, with standard data input the accuracy of Xception, VGG19, and Resnet is 97%, 97%, and 92.5%, respectively. With 512 by 512 pixels, the accuracy improved to 97.5%, 97.5%, and 93.3%. Each of the model managed to gain satisfactory auc, precision, recall and f-1 score which indicates none of the trained model is biased towards any particular class. Additionally, for multi-class classification, Resnet50 reached an accuracy of only 75% whereas VGG19 reached 92% and Xception gained 93% with satisfying precision, recall, auc and f-1 score. Furthermore, we have utilized Explainable AI that introduces interpretability to our models. This section revealed that, although almost both Xception and VGG-19 model managed to acquire moderate score, Xception was the most superior in terms of identifying the precise features responsible for a model's prediction.
Even though we were able to achieve an accuracy of 93%, it may still be unstable in everyday use as the patients are directly impacted by the predictions made by the model, and even a single misclassification might have serious consequences. Thus, in the future, we intend to further improve the models accuracy and want to utilize various transformer-based frameworks that can reach higher accuracy. Moreover, for the multiclass classification we have only considered COVID-19, non-COVID viral pneumonia, normal and lung opacity X-ray images. However, lung opacity includes a variety of other lung infections such as lung cancer, pulmonary edema, pulmonary embolism, etc. We aim to add additional classes in our future model that can classify X-ray images beyond just lung opacity. Furthermore, since medical data requires an extra layer of privacy we plan on developing a federated learning based model.